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Abstract. PjTosequencing is among the emerging sequencing techniques, capa- 
ble of generating upto 100,000 overlapping reads in a single run. This technique 
is much faster and cheaper than the existing state of the art sequencing technique 
such as Sanger. However, the reads generated by pyrosequencing are short in size 
and contain numerous errors. Furthermore, each read has a specific position in 
the reference genome. In order to use these reads for any subsequent analysis, the 
reads must be aligned . Existing multiple sequence alignment methods cannot be 
used as they do not take into account the specific positions of the sequences with 
respect to the genome, and are highly inefficient for large number of sequences. 
Therefore, the common practice has been to use either simple pairwise alignment 
despite its poor accuracy for error prone pyroreads, or use computationally expen- 
sive techniques based on sequential gap propagation. In this paper, we develop a 
computationally efficient method based on domain decomposition, referred to as 
pyro-align, to align such large number of reads. The proposed alignment algo- 
rithm accurately aligns the erroneous reads in a short period of time, which is 
orders of magnitude faster than any existing method. The accuracy of the aUgn- 
ment is confirmed from the consensus obtained from the multiple alignments. 



1 Introduction 

Pyrosequencing is among the emerging sequencing techniques developed for deter- 
mining the sequences of DNA bases from a genome. It is capable of generating up 
to 100,000 overlapping reads in a single run. However, multitude of factors, such as 
relatively short read lengths (i.e., as of 2008 an average of 100 — 250 nt compared to 
800 — 1000 nt for Sanger sequencing), lack of a paired end protocol, and limited ac- 
curacy of individual reads for repetitive DNA, particularly in the case of monopolymer 
repeats, present many computational challenges [14] to make pyrosequencing useful 
for biology and bioinformatics applications. 

For over more than a decade, Sanger sequencing has been the cornerstone of genome 
sequencing including that of microbial genomes. Improvements in DNA sequencing 
techniques and the advances in data storage and analysis, as well as developments in 
bioinformatics have reduced the cost to a mere 8000$ — 10000$ per megabase of high 
quality genome draft sequence. However, the need of more efficient and cost effective 
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approaches has led to development of new sequencing technologies such as the 454 
GS20 sequencing platform. It is a non-cloning pyrosequencing based platform that is 
several orders of magnitude faster than the Sanger machines. However, the new tech- 
nology despite its enormous advantage in terms of time and money will not be able to 
replace the current Sanger technology, unless the reads generated are properly aligned 
with respect to the reference genome. 

The key issues associated with the use of pyrosequencing technique are as under: 
Read Length: The read length is expected to be of the order of 100 — 2506p on 
average. This is much shorter than the other state of the art Sanger machines which give 
out consistent read lengths of the order of > 800 — 9006p. 

Orientation: This is generally the case for most of the sequencing technologies. 
Each DNA helix will be broken into the original and its Watson-Crick complement. 
These would be further broken up into pieces, and there is generally no way to reveal 
which of the two is it. The problem is more severe and usually encountered for genome 
reconstruction. 

Errors: Each individual DNA sequence or read is likely to have errors in the form 
of insertions and deletions. It may also have mutations and the pyrosequencer may 
itself make errors. These errors correspond to homopolymer effects, including extension 
(insertions), incomplete extensions (deletions), and carry forward errors (insertions and 
substitutions). Insertions are considered the most common type of error (36% of errors) 
followed by deletions (27%), ambiguous bases, Ns (21%), and substitutions (16%) [28]. 
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Fig. 1. Pairwise alignment of the reads with the reference genome is shown 

For most practical purposes, pyroreads without any post processing are of limited 
use. One of the most widely required tasks as a pre processing step for many applica- 
tions, including haplotype reconstruction [12] [13], analysis of microbial community 
analysis [3], analysis of genes for diseases [2], is the alignment of these reads with the 
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wild type. For important applications such as viral population estimation or haplotype 
reconstruction of various viruses e.g., HIV in a population, scientists usually have the 
information about the wild type genome of the virus. While for other sequencing tech- 
nologies, such as Sanger, simple pair-wise alignment with the wild type may produce 
reasonable multiple alignment, in the case of pyrosequencing, the variation in the hap- 
lotype population compounded with the errors introduced in the reads does not allow 
feasible multiple alignment by simple pair-wise alignment. Fig. 1 depicts simple pair- 
wise alignment of pyrosequence reads with a reference genome. We assert that accurate 
and workable multiple aUgnment is often necessary for a variety of applications and 
statistical packages to work with these pyroreads, as demonstrated in [12] [13] [3] [2]. 

In theory, alignment of multiple sequences can be achieved using pair-wise aUgn- 
ment, each pair getting alignment score. But for optimal alignment the sum of all the 
pair-wise ahgnment scores need to be maximized, which is an NP complete prob- 
lem [15]. Towards this end, dynamic programming based solutions of O(i^) com- 
plexity have been pursued, where N is the number of sequences and L is the average 
length of a sequence. Such accurate optimizations are not practical for large number of 
sequences -as is the case in pyrosequencing- , thus making heuristic algorithms as the 
only feasible option. The literature on these heuristics is vast and includes widely used 
works, such as Notredame et. al. [16], Edgar [18], Thompson et. al. [17], Do et. al. 
[22], and Morgenstem et.al. [20]. These heuristics are complex combination of ad-hoc 
procedures with some flavor of dynamic programming. Despite the usefulness of these 
widely used heuristics, they scale very poorly with increasing number of sequences. 

For multiple alignment of pyroreads, 'out of the box' use of these heuristics is not 
feasible because of two main reasons: 1) the pyrosequencing reads can be very large 
in number (up to 100, 000 usable reads in a single run (with a Roche GS20 platform), 
and 2) the heuristics do not take into account the positions of the reads with respect to 
the reference genome. Additional factors such as short lengths and errors, and the fact 
that these reads have preceding or traihng 'gaps' pose further ahgnment challenges. 
In [12], an alignment technique based on sequential gap propagation has been used. 
This technique is computationally expensive and its aUgnment quaUty decreases with 
the increase in the mutation value. 

In this paper, we present a computationaUy efficient algorithm pyro-align, specif- 
ically designed for multiple alignment of DNA reads obtained from pyrosequencing. 
The proposed algorithm is based on a novel domain decomposition concept, therefore 
it is capable of aligning very large number of pyrosequences. It takes into account the 
position of the reads with respect to the reference genome, and assigns weight to the 
leading and trailing gaps for the reads. 

The objective of our work is to develop a multiple alignment system for small error 
prone reads, such that the errors in the aUgnment are 'highlighted' and the system is 
able to handle large number of reads, as may be expected from pyrosequencing reads. 

We assume that the reads may be generated from one or many genomes, with 'for- 
ward' orientation. We also assume that the reference genome (or its wild type) from 
which the reads are generated is available, as is generally the case for haplotype re- 
construction. In our experiments, we have used HIV-pol gene virus as the reference 
genome (with length of 1970bp) and simulator Readsim [11] to generate these reads. 
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The algorithm uses concepts from domain decomposition and parallel multiple aUgn- 
ment techniques [1,21]. 

For the sake of completeness, let's first formally define the Multiple Sequences 
Aligrmient problem in its generic form, without indulging with the issues such as scor- 
ing functions. Let N sequences be presented as a set S = {81,82, 83, ■ ■ ■ , 8n} and 
let 5 = S'2, 5*3, • • • , S'jy} be the aligned sequence set, such that all the sequences 
in 8 are of equal length, have maximum overlap, and the score of the global map is 
maximum according to some scoring mechanism suitable for the application. 

A perfect multiple alignment for pyroreads would be, that the reads are aligned with 
each other such that the position of the reads with respect to the reference genome is 
conserved; the reads have maximum overlap and are of equal lengths after the align- 
ment, including leading and trailing gaps. 

The intuitive idea behind the proposed pyro-align algorithm is to first place the 
reads in correct orientation with respect to the reference genome and then use progres- 
sive alignment to achieve the final alignment. For efficient progressive alignment, the 
correctly placed reads are reordered according to the starting position, and a computa- 
tionally low complexity similarity metric is extracted from this ordering position. The 
similarity metric is then used to align pairs of aligned reads using a hierarchical de- 
composition strategy. The proposed multiple alignment algorithm takes advantage of 
the pyroreads characteristics and brings in techniques from data structures and parallel 
computing to realize a low complexity solution in terms of time and memory. 

The proposed alignment algorithm, pyro-align, consists of the following two main 
components: 

1. Semi-Global alignment 

2. Hierarchical progressive alignment 

(a) Reordering of reads to generate guidance tree 

(b) Pairwise and profile-profile alignment 

Each component is designed considering the characteristics of pyroreads and it is 
described in the following sections along with its justification. 

1.1 Semi-Global Alignment 

The first step is to determine the position of each read with respect to the reference 
genome. If this step is omitted, there are number of aUgnments that would be correct, 
but would be inaccurate if analyzed in the global context. A read that is not constricted 
in terms of position, may give the same score (SP score) for the multiple alignment but 
would be incorrect in context of the reference. To accomplish the task of 'placing' the 
reads in the correct context with respect to the reference genome we employ semi-global 
alignment procedure. 

The semi global alignment is also referred to as overlapping aUgnment because 
the sequences are globally aligned ignoring the start and end gaps. For semi-global 
alignment we use a modified version of Needleman-Wunsch algorithm [5]. 

The modification in the basic version of Needleman-Wunsch is required to handle 
the leading and trailing gaps of the reads when aligning to the reference genome. If 
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the leading and trailing gaps are not ignored, considering the short length of the reads, 
the alignment scores would be dominated by these gaps, hence giving an inaccurate 
alignment with respect to the genome. 

Let the two sequences to be aUgned be s and t, and M{i, j) presents the score of 
the optimal aligrmient. Since, we do not wish to penaUze the starting gaps, we mod- 
ify the dynamic programming matrix by initiaUzing the first row and first column 
to be zero. The gaps at the end are also not to be penaUzed. Let M{i,j) represent 
the optimal score of si, - ■ ■ ,Si and ti, - ■ ■ ,tj. Then M{m, j) is the score that repre- 
sents optimally aUgning s with The optimal alignment therefore, is now de- 
tected as the maximum value on the last row or column. Therefore the best score is 
= maXk,i{M{k, n),M{m, I)), and the alignment can be obtained by tracking 
the path from M{i, j) to M(0, 0). For additional details on semi-global ahgnment we 
refer the reader to [8]. 

Once each read has been semi-globally aligned with the reference genome, we ob- 
tain reads with leading and traiUng gaps, where the first character after the gaps is the 
starting position of the read with respect to the reference genome. The information for 
these alignments are stored in hashtables that are further used for processing in reorder- 
ing the reads for ahgnment. 

2 Hierarchical Progressive Alignment 

Generally multiple sequence alignment (MSA) procedures are either based on iterative 
methods or employ progressive techniques. Although, progressive techniques relative 
to iterative techniques are more efficient, they are not suitable when the sequences are 
relatively diverse or the number of sequences is very large. Considering the fact that the 
pyroreads are highly similar, we develop a hierarchical progressive alignment procedure 
that is also computationally efficient for large number of reads. 

Progressive alignment techniques develop final MSA by combining pair-wise align- 
ments beginning with the most similar pair and progressing to the most distantly related. 
All progressive alignment methods require two stages: a first stage in which the rela- 
tionships between the sequences are represented as a tree, called a guide tree, and a 
second step in which MSA is built by adding the sequences sequentially to the grow- 
ing MSA according to the guide tree. In the following, we describe the low complexity 
components of pyw-align. 

2.1 Reordering Reads 

The method followed by most of the progressive multiple alignment algorithms is that a 
quick similarity measure is computed that is based on k-mer counting [4] or some other 
heuristic mechanism. These pair-wise similarity measures (distances) are tabulated in a 
matrix form and a tree is constructed from this distance matrix using UPGMA or neigh- 
boring joining. The progressive ahgnment is thus built, following the branching order 
of the tree, giving a multiple alignment. These steps require 0{N'^) time each, where 
TV is the number of reads. To reduce this complexity, we exploit the fact that the reads 
are coming from the same reference or nearly the same reference. This in turn implies 
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that the reads starting from the same or near same 'starting' point with respect to the 
reference genome are hkely to be similar to each other. Therefore, we already have the 
ordering information or the 'guide tree' from the first step of the algorithm. Our guide 
tree, or the order in which sequences will be aligned in the progressive alignment is 
from the starting position of the reads from the first stage. Of course the decomposition 
of the reads (the subtree of the profiles that we built) doesn't render the reads in the 
same order as in traditional progressive alignment, but nevertheless the order is more or 
less the same when the profiles of these reads are aligned. 

Let there be number of reads R = Ri,R2, - ■ ■ ,Rn generated from pyrosequenc- 
ing technique, from the reference genome of length Lg. Also, let the length of each 
read denoted by L{R)p. After executing semi-global alignment using the algorithm dis- 
cussed in the previous section, let each read be presented by Rpg, where the p*'' read 
has q leading gaps and Lg — q — L{R)p trailing gaps. Then the reordering algorithm 
would reorder the reads such that after the reads are reordered using the information 
from the leading gaps, the read Rpq comes in ordering 'before' y p,q & Lg. 

To execute the reordering in an efficient manner, we employ hashtables that speed 
up the search process. We create two hashtables: hashtablei uses fasta sequence tag 
as the hash key and stores the corresponding starting position of the read; hashtable2 
stores the read names (fasta sequence tag) and the dna sequence it is associated with. 
Using these tables, the reads are reordered in the database in linear time. 

2.2 Pair-wise and Profile-Profile Alignments 

The ordering of the reads determined in the preceding step is now used to conduct the 
progressive alignment. Traditional progressive alignment requires that the sequences 
most similar to each other are aligned first. Thereafter, sequences are added one by one 
to the multiple alignments determined according to some similarity metric. This sequen- 
tial addition of sequences for progressive aUgnment is not suitable for large number of 
sequences. In order to devise a low complexity system, we design a hierarchical pro- 
gressive alignment procedure that is based on domain decomposition [1], as described 
below and depicted in Figure 2. 

First of all, pair-wise local alignment using standard Needle-Wunsch is executed 
on each overlapping pair of reads (the ordering is still the same as discussed in the 
previous section). After this stage, the reads are ahgned in pairs such that we have 
pairs of aligned reads. These N/2 pairs of reads are then used for profile aligrmients as 
discussed below. 

Profile-profile alignments are used to re-align two or more existing alignments(in 
our case the pairs of aligned reads). It is useful for two reasons; one being that the 
user may want to add sequences gradually, and second being that the user may want to 
keep one high quaUty profile fixed and keep on adding sequences aligned to that fixed 
profile [17]. 

We take advantage of both of these properties in our domain decomposition. 

In this stage of the algorithm, the N/2 pairs of ahgned reads have to be combined to 
get a multiple alignment. We have shown in [21] that the decomposition of the profiles 
gives a fair amount of time advantages even on a single processor. Therefore a hierar- 
chical model similar to [1] is implemented (see Fig. 2). The model requires that instead 




Fig. 2. Hierchical profile-profile alignments for pyro-align is shown 



of combining the profiles in a sequential manner (one by one), a binary tree is built such 
that the profiles to be aligned are the leafs of the tree. 
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Fig. 3. Two profiles(X and Y) are aligned under the columns constrains, producing profile Z 

In order to apply pair-wise alignment functions to profiles, a scoring function must 
be defined, similar to the substitution methods defined for pair- wise alignments. One of 
the most commonly used profile functions is the sequence-weighted sum of substitution 
matrix scores for each pair of amino acid letters. Let i and j be the amino acid, pi 
the background probability of i, pij the joint probability of i and j aligned to each 
other, Sij the substitution matrix being used, /f the observed frequency of i in column 
X of the first profile, xq the observed frequency of gaps in that column. The same 
attributes are assumed for the profile y. Profile sum of pairs (PSP) is the function used 
in Clustalw [17], Mafft [23] and Muscle [19] to maximize Sum of Pairs(SP) score, 
which in turn maximizes the alignment score such that the columns in the profiles are 
preserved, as depicted in Fig. 3. The PSP score can be defined as in [24] and [19]; 
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Fig. 4. The final Alignment of the reads 

For our purposes, we will take advantage of PSP functions based on 200 PAM ma- 
trix [25] and the 240 PAM VTML matrix [26]. Some multiple alignment methods im- 
plement different scoring functions such as Log expectation (LE) functions, but for our 
purposes PSP scoring suffices. Profile functions have evolved to be quite complex and 
good discussion on these can be found at [19] and [27]. We use the profile functions 
from the clustalw system. The final alignment from the pyw-align algorithm can be 
seen in Fig. 4. Different steps of the proposed pyro-align Algorithm are outlined below. 



Input: Reads generated from pyrosequencing procedure and Reference Genome 
Output: A Multiple Alignment of Reads is returned 

//Calculate overlapping of each of the reads with respect to the reference Genome 
for (i = l;i < N;i + +) do 

Overlapped-Reads ^ Semi-Global- AUgnment(i?i, Genome) ; 

end 

Reordered-Reads ^ Reordering(Overlapped-Reads) ; 

//Pairwise alignment using standard Needle- Wunsch is exectued, for pairs of 

ordered reads ; 

Pair-wise-aligned ^ Needle- Wunsch(Reordered-Reads) ; 
//Profile-profile alignment is obtained using Sample-align-D strategy 
Final-Alignment ^Profile-Profile-alignment(Pair-wise-aligned) ; 
return Final-Alignment ; 



Algorithm 1: Steps of the Proposed Multiple Sequence Alignment pyro-align 
Algorithm 
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3 Performance Analysis 

As discussed earlier in the paper, the exact solution for multiple alignment is not feasible 
and heuristics are employed. Most of these heuristics perform well in practice but there 
is generally no theoretical justification possible for these heuristics [9]. For pyro-align 
it can be shown that the semi-global alignment of the reads with the reference genome 
is analogous to center star alignment. The center star alignment is shown to give results 
within 2-approx of the optimal alignment [9] in worst case and same can be expected 
from the semi-global alignment of reads with reference genome. The accuracy of the 
later stages is confirmed by rigorous quality assessment procedure described in the 
section below. 

3.1 Experimental Setup and Quality Assessment 

The performance evaluation of the algorithm has been carried on a single desktop sys- 
tem 2x QuadCore Intel 5355 2.66 GHz, 2x4 MB Cache and 16GB of RAM. The oper- 
ating system on the desktop is RedHat Linux with kernel 2.6. 18-92.1. 13. el5. The soft- 
ware uses libraries from Biojava [7] and is built using Java version "1.6.0" Java(TM) 
SE Runtime Environment,lBM J9 VM. 

To investigate the quality of the alignment produced by the algorithm we used Read- 
sim simulator [11] to generate the reads. The quality assessment of multiple alignment 
is generally carried out using benchmarks such Prefab [18] or BaliBase [6]. However, 
these benchmarks are not designed to access the quality of the aligned reads produced 
from pyrosequencing, and there are no benchmarks available specifically for these 
reads. Therefore, a system has to be developed to access the quality of the aligned 
reads. The experimental setup for the quality assessment of the aUgnment procedure is 
shown in the Fig. 5 and is explained below. 

Our quality assessment have two objectives: (l)to assess the quality of the align- 
ment produced by pyro-align with respect to the original genome (2) ensure that the 
system must be able to handle reads from multiple haplotype for alignment. 

To achieve these objectives, we setup the quality assessment system as shown Fig. 5. 
We used a HIV pol gene virus with length of 1970bp as the wildtype for the experi- 
ments. The wildtype is then used to produce 4 sets of genomes, randomly mutated at 
different rate; The four sets of genomes are Dist-003, Dist-005, Dist-007 and Dist-010, 
with mutations of 3%, 5%, 7% and 10%, respectively. Now using the mutated genomes, 
2000 and 5000 reads from the Readsim were generated using standard ReadSim param- 
eters with forward orientation. 

The generated reads from these mutated genomes were then aligned with the wild- 
type.This procedure is adopted because generally scientists only have a wildtype of the 
microbial genomes available and therefore it depicts a more practical scenario. 

After the aUgnment, a majority consensus of the reads is obtained. A distance based 
similarity is then calculated of the consensus obtained from the aligned reads with the 
original genome from which the reads were generated. The results of the alignment ob- 
tained and the accuracy of the consensus thus obtained are shown in Fig. 6 and Fig. 7 
for 2000 and 5000 reads respectively. 
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Fig. 5. The experimental setup for the quality assessment of the multiple alignment program 



We compare the accuracy of the algorithm with two different methods. First being 
the simple pair-wise alignment of the reads with the reference genome. Secondly, we 
compare it with a sequential gap propagation method, used in recent pyrosequencing 
systems [12]. Simply put, gap propagation method builds multiple ahgnment from parr- 
wise alignments by sequentially 'propagating' the gaps from each pairwise aligiraient 
to aU the reads in the system. Propagation of gaps is accomplished for every position 
where at least one read has an inserted base. A gap is inserted in the reference genome 
and, consequently, in all reads that overlap the genome at that position. The complexity 
of the procedure is of the order of 0{N^). 

The accuracy of the consensus obtained using just the pairwise alignment is less 
than 55% and that obtained from the pyro-align is always greater than 96%. An even 
better alignment quality is achieved for greater number of reads, because more number 
of reads provide a better coverage for a genome of given length. The accuracy of the 
gap propagation procedure, is comparable to pyro-align for small mutations, but as the 
mutations increase the accuracy of gap propagation based method decreases. 

To illustrate that the alignment system also works with a 'mixture' of reads from 
different haplotype, we use the mutated reads from Dist-003, Dist-005 and Dist-007 
to generate a new set of reads. The new set contains equal number of reads from the 
mutated sets e.g. 2000 reads from each mutated genome for the results shown. The reads 
are then aligned by the pyro-align algorithm using wildtype as the reference genome. 
The results of alignment for this mixture set are shown in Fig. 8 for Dist-003/Dist-005 
and Dist-005/Dist-007 mixtures. It must be noted here that we don't have a 'groimd 
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Fig. 6. The quality of the alignment using pairwise, pyro-Align and 'propagation' methods for 
2000 reads 



truth' genome in these cases and hence no genome is available to compare the consensus 
obtained from the alignment. 

However, we do know the mutation rates for the genomes from which the mixture 
sets were generated. Therefore, if an optimal alignment of these reads is obtained, the 
'mutation' in the consensus should not be greater than the combined mutations of the 
genomes. For example consider the case of Dist-003/Dist-005 mixture. We know the 
mutation rates for the genomes from which the reads generated were 3% and 5% with 
respect to the wildtype. Therefore, for accurate alignment, the consensus of the align- 
ment should not vary more than 8%, in the worst case, when compared to the wildtype. 
Same would be true for the other cases considered according to the mutation rates of 
the genomes. As can be seen that the results of the alignment compared with the wild- 
type are well within the expected limits. The accuracy of the pairwise alignment of 
the reads with the reference genome(in this case the wildtype), and that obtained using 
propagation method is also shown for comparison. 



4 Complexity Analysis 

In this section we briefly outline the complexity of the proposed pyro-align algorithm. 
Recall the pyro-align algorithm consists of these major steps: semi-global alignment, 
reordering, pair-wise alignment, and profile-profile alignment. 

We assume that the number of reads is N with the average length of the read equal 
to Lff. Let's further assume that the length of the reference genome is equal to Lg. 
Then, the complexity of the semi-global alignment (overlapping alignment) is equal to 
0{N LfiLg). The clustering of the reads can be done in 0{NLg) and the reordering 
using hashtables can be achieved in 0{N), making the total for this stage equal to 
0{NLg + N). The pairwise alignment of the reads is shown to be achieved in 0{NLj^) 
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Fig. 7. The quality of the alignment using pairwise, pyro-Align and 'propagation' methods for 
5000 reads 

and the profile-profile alignment can be achieved in 0{NlogN x L^). This makes the 



total complexity equal to 0{NLjiLg + NLg 
asymptotically equal to 0{NlogN x + 



NlogN Ll). This is 
A^L|j).The advantage of low complexity 



of pyro-align was further evident by our experimentation. We were able to align 2000 
reads of average length 2506p from a genome of length 1970bp in about 12 minutes 
compared to 6 hours of computation using more traditional multiple alignment systems 
such as Clustalw. 



5 Conclusion 

The short reads from the pyrosequencing method are rendered useless if they are not 
multiple aligned for magnitude of important applications, such as haplotype reconstruc- 
tion and error elimination. We have presented an efficient hierarchical procedure to 
multiple align large number of short reads from the pyrosequencing procedure. 

We demonstrated that simple-pair-wise alignment is not feasible in the case of py- 
roreads. We also showed that the proposed method is much faster than traditional time 
consuming multiple alignment methods such as Clustalw or Tcoffee. We also presented 
the quality assessment results and compared those with the results obtained by simple 
pair-wise alignment procedure and 'propagation' methods. 
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